Bag of Words

From text to numbers

Luis Sattelmayer

2025-01-14

Converting text to numbers

  • Text as data consists in transforming text into numbers
  • Different ways to do this
  • Start with Bag of Words representation

Bag of Words (BoW)

Bag of words featurization quantifies the frequency of words in text documents for processing in machine learning models

  • An unstructured assortment of all words present in a text
  • Rely on word frequency counts

Bag of Words

Limits of BoW

  • Word correlation:
    • context blind
    • election and president co-ocurr more often than president and poet
    • a simple individual term frequency omits this relationship
  • Compound words:
    • without accounting for words that belong together, every word will be judged individually
    • “White House” treated as equal to “white” + “house”/“Donald Trump” will be treated as “donald” + “trump”
    • solution: n_grams

Limits of BoW

  • Polysemous words:
    • multiple meanings: climate (global warming) \(\neq\) climate (tense, climate of a debate)
    • CON-tent \(\neq\) con-TENT
  • Sparsity:
    • each word is a feature or dimension of the model
    • inefficient for computing

Limits of BoW

Solutions for the Limitations

  • Bag of n-grams:
    • minimum a bi-gram: “New York”, “Eiffel Tower”, “Donald Trump”, “White House”
    • still not 100% informative: “on the”, “will be”
  • Normalization/Preprocessing (see later slides):
    • stemming
    • removing punctuation
    • removing stopwords

N-grams

n-gram Size Description n-gram Examples
Unigram (1-gram) Single word “Natural,” “language,” “processing”
Bigram (2-gram) Sequence of two consecutive words “Natural language,” “language processing”
Trigram (3-gram) Sequence of three consecutive words “Natural language processing”
4-gram Sequence of four consecutive words “Natural language processing tasks”

Zipf’s Law

When we rank observations by their frequency, the frequency of a specific observation occurrring is inversely proportional to its rank.

  • The most common words in a language appear twice as often as the next common one
  • the appears 7% of the time in the Brown Corpus of American English, while of accounts for 3.5%
  • This means: the most common, are the least informative words

Zipf’s Law

Preprocessing/Normalizing

  • Tokenization
    • splitting text in smaller units of analysis
    • eg.: sentences, quasi-sentences, words can each be tokens
  • Lowercasing
    • “Cat” and “cat” are not the same for Bag of Words
  • Removing Stopwords
    • common words that carry little semantic meaning: “is”, “of”, “the,” “and”
    • focuses on meaningful words and reduces the dimensionality of the BoW representation
  • Removing Punctuation
    • stripping commas, dots and so on
  • Stemming
    • Reducing words to their root or base form by removing suffixes: “running” –> “run”

Stemming vs Lemmatizing

Aspect Lemmatization Stemming
Process Context-aware, dictionary-based Rule-based, heuristic chopping
Output Produces valid dictionary words May produce non-words (eg, “running” → “runn”)
Context Considers grammatical structure Ignores grammatical context
Accuracy High (linguistically accurate) Lower (prone to errors)
Speed Slower due to dictionary lookup Faster because it’s rule-based
Examples “running” → “run”, “better” → “good” “running” → “run”, “better” → “better”

Term-Document Matrix

A matrix where rows represent terms (e.g., words, phrases, or n-grams) and columns represent documents.

  • Each cell contains a value representing the occurrence or importance of the term in the corresponding document
  • Common Values in Cells:
    • Raw Term Frequency (TF): Counts how often a term appears in a document
    • Binary Presence: 1 if the term appears in the document, otherwise 0
    • Weighted Measures: TF-IDF (Term Frequency-Inverse Document Frequency)
  • Sparse matrix, inefficient for large corpora

Document-Term Matrix

A matrix where rows represent documents and columns represent terms (e.g., words, phrases, or n-grams)

  • Each cell contains a value that represents the occurrence or importance of the term in the document

  • Can be used for same purposes as TDF

  • Very sparse, when large vocabulary size

Here is a TDM
Term Doc1 Doc2 Doc3
climate 5 2 0
president 3 0 1
immigration 2 4 6
economy 1 1 3
Here is a DTM
Document climate president immigration economy
Doc1 5 3 2 1
Doc2 2 0 4 1
Doc3 0 1 6 3

TDM vs. DTM

Feature Term-Document Matrix (TDM) Document-Term Matrix (DTM)
Rows Terms (words or phrases) Documents
Columns Documents Terms (words or phrases)
Primary Use Explore term trends across documents Explore document trends across terms
Transposition Can be transposed to form a DTM Can be transposed to form a TDM
Applications Topic modeling, text mining Text classification, semantic analysis
Example Structure Row = “climate,” Column = “Doc1,” Value = 5 Row = “Doc1,” Column = “climate,” Value = 5
Dimensionality Wide structure for large vocabularies Tall structure for large document collections

tf-idf

  • In standard bag of words models, semantically irrelevant words have the highest term frequency, and so greatest weight in a model
  • Term frequency-inverse document frequency can correct this
  • TF-IDF accounts for the word’s prevalence throughout every document in a text set

tf-idf

  • High tf-idf: frequent term in the document but rare in the overall corpus –> more unique or important for that document

  • Low tf-idf: term is either frequent across the corpus (e.g., stopwords like “the,” “and”) or infrequent in the document, meaning it has less distinguishing power

  • Input for machine learning models:

    • text classification (e.g., spam detection, sentiment analysis, random forests)
    • document clustering
  • Information Retrieval:

    • search engines
    • document matching (using cosine similarity –> Thursday)
  • Anomaly Detection

Use cases of BoW

  1. Scaling
  2. Topic Modeling
  3. Text classification

Scaling

Quantify and compare the positions of documents (e.g., texts, articles, speeches) along one or more dimensions based on their linguistic content

  • Use cases in Political Science: Map political speeches or manifestos along ideological dimensions (e.g., left-right, liberal-conservative)

Scaling application

An uninformed scaling application by myself a couple of years ago

Summary: Bag of Words

  • Definition: Represents text as an unordered collection of words, ignoring grammar and word order
    • focuses on word frequency.
    • treats documents as a “bag” of individual terms
  • Strengths:
    • Simple and intuitive for feature extraction
    • Effective for basic text analysis tasks like classification and clustering
  • Limitations: context blind, sparse representation, polysemy, compound words
    • apply preprocessing: tokenization, stopword removal, stemming, lemmatization
  • Applications:
    • n-grams, TF-IDF,Scaling documents along dimensions (e.g., ideology), Topic modeling and clustering
Benoit, Kenneth, and Michael Laver. 2003. “Estimating Irish Party Policy Positions Using Computer Wordscoring: The 2002 Election–a Research Note.” Irish Political Studies 18 (1): 97–107.